Release 10.1A: OpenEdge Development:
Internationalizing Applications


Creating and modifying word-break tables

OpenEdge provides a collection of word-break tables in the DLC/prolang/convmap directory. Figure 3–6 shows one of them, big5-bas.wbt. This filename reflects the code page big-5, a code page used for Traditional Chinese.

/*
 *
 * NAME: big5-bas.wbt
 * Progress Word Break Source File for codepage big-5
 *
 */

version = 9
codepage = big-5
wordrules-name = basic
type = 3

/* Special word break rules table */
word_attr = 
{
     ’.’,  BEFORE_DIGIT,   /* part of a word only if followed by a digit */
     ’,’,  BEFORE_DIGIT,
     ’-’,  BEFORE_DIGIT,
     ’’’,  IGNORE,         /* completely ignore it */
     ’$’,  USE_IT,         /* always part of a word */
     ’%’,  USE_IT,
     ’#’,  USE_IT,
     ’@’,  USE_IT,
     ’_’,  USE_IT 
}; 

Figure 3–6: The big5-bas.wbt word-break table

Understanding word-delimiter attributes

The keywords BEFORE_DIGIT, IGNORE, and USE_IT, which appear in Figure 3–6, are word-delimiter attributes. Each word-delimiter attribute describes a word-break role played by a code page element. The complete set of word-delimiter attributes appears in Table 3–4.

Table 3–4: Word-delimiter attributes
Word delimiter attribute
Description
Default
BEFORE_DIGIT 
Treated as part of a word only if followed by a character with the DIGIT attribute.
Assigned to the following characters:
  • Period (.)
  • Comma (,)
  • Hyphen (-)
For example, "12.34" is one word, but "ab.cd" is two words.
BEFORE_LET_DIG 
Treated as part of a word only if followed by a character with the LETTER or DIGIT attribute.
BEFORE_LETTER 
Part of a word only if followed by a character with the LETTER attribute.
Else, treated as a word delimiter.
DIGIT 
Always part of a word.
Assigned to the characters 0–9.
IGNORE 
Ignored.
Assigned to the apostrophe (’).
For example, "John’s" is equivalent to "Johns."
LETTER 
Always part of a word.
Assigned to all characters that the current attribute table defines as alphabetic. In English, these are the uppercase characters A–Z and the lowercase characters a–z.
TERMINATOR 
Word delimiter.
Assigned to all other characters.
USE_IT 
Always part of a word.
Assigned to the following characters:
  • Dollar sign ($)
  • Percent sign (%)
  • Number sign (#)
  • At symbol (@)
  • Underline (_)

Word-break table syntax

Word-break behavior varies widely between and even within locales. If CONTAINS queries do not work as expected in a particular locale, you can copy a word-break table that OpenEdge provides and modify it as necessary. You can also create a word-break table from scratch. The syntax is as follows:

Syntax
[ #define symbolic-name symbol-value ] ... 
[ Version = 9 
   Codepage = codepage-name  
   wordrules-name = wordrules-name  
   type = 3 
] 
word_attr = 
{ 
  { char-literal | hex-value | decimal-value } , word-delimiter-attribute 
      [ , { char-literal | hex-value | decimal-value } 
          , word-delimiter-attribute ] ... 
}; 

symbolic-name

The name of a symbol. For example: DOLLAR-SIGN.

symbol-value

The value of the symbol. For example: '$'.

Note: Although OpenEdge and some versions of Progress let you compile word-break tables that omit all items within the second pair of square brackets, Progress Software Corporation recommends you always include these items. If the source-code version of a compiled word-break table lacks these items, and the associated database is not so large as to make this impractical, Progress Software Corporation recommends you add these items to the table, recompile the table, reassociate the table with the database, and rebuild the indexes.

codepage-name

The name, not surrounded by quotes, of the code page the word-break table is associated with. The maximum length is 20 characters. For example: UTF-8.

wordrules-name

The name, not surrounded by quotes, of the compiled word-break table. The maximum length is 20 characters. For example: utf8sample.

type=3

The table type. Although OpenEdge supports existing word-break tables of type 1 and type 2, Progress Software Corporation recommends you change their table type to 3. If you do, you also must recompile the word-break table, reassociate it with the database, and rebuild the indexes.

char-literal

A character within single quotes or a symbolic-name, which represents a character in the code page. For example: '#'.

hex-literal

A hexadecimal value or a symbolic-name, which represents a character in the code page. For example: 0xAC.

decimal-literal

A decimal value or a symbolic-name, which represents a character in the code page. For example: 39.

word-delimiter-attribute

In what context the character is a word delimiter. Use one of the word delimiter attributes in Table 3–4.


Copyright © 2005 Progress Software Corporation
www.progress.com
Voice: (781) 280-4000
Fax: (781) 280-4095